922 research outputs found
Semantic Embedding Space for Zero-Shot Action Recognition
The number of categories for action recognition is growing rapidly. It is
thus becoming increasingly hard to collect sufficient training data to learn
conventional models for each category. This issue may be ameliorated by the
increasingly popular 'zero-shot learning' (ZSL) paradigm. In this framework a
mapping is constructed between visual features and a human interpretable
semantic description of each category, allowing categories to be recognised in
the absence of any training data. Existing ZSL studies focus primarily on image
data, and attribute-based semantic representations. In this paper, we address
zero-shot recognition in contemporary video action recognition tasks, using
semantic word vector space as the common space to embed videos and category
labels. This is more challenging because the mapping between the semantic space
and space-time features of videos containing complex actions is more complex
and harder to learn. We demonstrate that a simple self-training and data
augmentation strategy can significantly improve the efficacy of this mapping.
Experiments on human action datasets including HMDB51 and UCF101 demonstrate
that our approach achieves the state-of-the-art zero-shot action recognition
performance.Comment: 5 page
Chinese loans in Old Vietnamese with a sesquisyllabic phonology
While consonant clusters, taken broadly to include presyllables, are commonly hypothesized for Old Chinese, little direct evidence is available for establishing the early forms of specific words. This essay examines a hitherto overlooked source: Old Vietnamese, a language substantially attested in a single document, which writes certain words, monosyllabic in modern Vietnamese, in an orthography suggesting sesquisyllabic phonology. For a number of words loaned from Chinese, Old Vietnamese provides the only testimony of the form of the Vietic borrowing. The small list of currently known sesquisyllabic words of Chinese origin attested in this document includes examples of both words with a secure initial Chinese cluster and words with plausible Vietic-internal prefixation
Positional Label for Self-Supervised Vision Transformer
Positional encoding is important for vision transformer (ViT) to capture the
spatial structure of the input image. General effectiveness has been proven in
ViT. In our work we propose to train ViT to recognize the positional label of
patches of the input image, this apparently simple task actually yields a
meaningful self-supervisory task. Based on previous work on ViT positional
encoding, we propose two positional labels dedicated to 2D images including
absolute position and relative position. Our positional labels can be easily
plugged into various current ViT variants. It can work in two ways: (a) As an
auxiliary training target for vanilla ViT (e.g., ViT-B and Swin-B) for better
performance. (b) Combine the self-supervised ViT (e.g., MAE) to provide a more
powerful self-supervised signal for semantic feature learning. Experiments
demonstrate that with the proposed self-supervised methods, ViT-B and Swin-B
gain improvements of 1.20% (top-1 Acc) and 0.74% (top-1 Acc) on ImageNet,
respectively, and 6.15% and 1.14% improvement on Mini-ImageNet
- …